1 Introduction

During the pandemic, subway ridership in New York City took a big hit. During the quarantine period, people stayed indoors and subway workers faced unprecedented hardship. In the face of the sudden outbreak of COVID-19, we are experiencing unprecedented panic and anxiety. I want to let you see through a visual, for a big city like New York, subway is a necessary tool for everyone to travel. The number of subway rides in New York City changed dramatically from March 2020, when the epidemic just started, to late 2020 and 2018, when there was no epidemic. I also looked at some data to see if there was any direct correlation between the number of confirmed cases during the pandemic and the number of passengers.

2 Data

2.1 Where did I got these data?

The original data are from MTA (https://new.mta.info/coronavirus/ridership) and NYC Open Data(https://opendata.cityofnewyork.us)

I also find the data at the website (https://qri.cloud/nyc-transit-data)


3 Ridership Comparison between 2018 & 2020

3.1 2020 daily ridership

#install.packages("ggplot")
library(ggplot2)
#library(tidyverse)
#library(dplyr)
library(scales)
daily2020<-read.csv("https://raw.githubusercontent.com/GoldenSweet/data/main/dailytotal2020.csv")
a<-ggplot(daily2020,aes(date,entries))
a+geom_bar(stat = "identity",
             color="darkslategray3")+
  theme(plot.title = element_text(size=16,hjust = 0.5),axis.text.x=element_blank(),axis.ticks.x=element_blank()) +
  labs(title="2020 NYC Daily Ridership",
       x="2020 Jan  -   2020 Dec  ",
       y="Daily Entries")+
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6) )

3.2 2018 NYC daily ridership

library(scales)
daily2018<-read.csv("https://raw.githubusercontent.com/GoldenSweet/data/main/dailytotal2018.csv")
a<-ggplot(daily2018,aes(date,entries))
a+geom_bar(stat = "identity",color="#F8766D")+
  theme(plot.title = element_text(size=16,hjust = 0.5),axis.text.x=element_blank(),axis.ticks.x=element_blank()) +
  labs(title="2018 NYC Daily Ridership",
       x="2018 Jan  -   2018 Dec  ",
       y="Daily Entries")+
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6) )

3.3 Compare 2020 & 2018 daily together

#install.packages("reshape2")
library(reshape2)
library(ggplot2)
library(scales)
daily2020_2018<-read.csv("https://raw.githubusercontent.com/GoldenSweet/data/main/daily2020%262018.csv")

dailyentry18 <- daily2020_2018$X2018dailyentries
dailyentry20 <- daily2020_2018$X2020dailyentries
day <- daily2020_2018$Date
df1 <- data.frame(dailyentry18, dailyentry20, day)
df2 <- melt(df1, id.vars='day')
ggplot(df2, aes(x=day, y=value, fill=variable)) +
    geom_bar(stat='identity', position='dodge')+
   theme(plot.title = element_text(size=16,hjust = 0.5),axis.text.x=element_blank(),axis.ticks.x=element_blank()) +
  labs(title=" NYC Daily Ridership",
       x="Jan  -   Dec  ",
       y="Daily Entries")+
 scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6) )

We can see from the graph, it is really messy to see. That’s because there are too many days. There are 365 days in a year, there is a bar every day. So I want to find a more obvious way to compare the data of two years. I want to do see if we do the data weekly will be better to show in the graph.

3.4 Compare 2018 & 2020 Weekly Ridership

weekly2018<-read.csv("https://raw.githubusercontent.com/GoldenSweet/data/main/weekly2018.csv")
weekly2020<-read.csv("https://raw.githubusercontent.com/GoldenSweet/data/main/weekly2020.csv")
weekentry18 <- weekly2018$weeklyentries
weekentry20<- weekly2020$weeklyentries
week <- weekly2018$X
df1 <- data.frame(weekentry18,weekentry20, week)
df2 <- melt(df1, id.vars='week')
ggplot(df2, aes(x=week, y=value, fill=variable)) +
    geom_bar(stat='identity',position = position_dodge(width = 1.8))+
   theme(plot.title = element_text(size=16,hjust = 0.5),axis.text.x=element_blank(),axis.ticks.x=element_blank()) +
  labs(title=" NYC Daily Ridership",
       x="Jan  -   Dec  ",
       y="Weekly Entries")+
 scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6) )

Now we can see it is much better visualization. There are very clear that there is a huge drop after the pandemic hit the New York City. At the begining of the pandemic was like only 30% to 40% of the year 2018 ridership.


3.5 Better look on Tableau

Tableau


4 Set Up Ordinary Least Squares (OLS) Linear Regression Model

4.1 What is OLS MODEL

Ordinary Least Squares (OLS) linear regression is a statistical technique used for the analysis and modelling of linear relationships between a response variable and one or more predictor variables. If the relationship between two variables appears to be linear, then a straight line can be fit to the data in order to model the relationship. The linear equation (or equation for a straight line) for a bivariate regression takes the following form:

y = mx + c

where y is the response (dependent) variable, m is the gradient (slope), x is the predictor (independent) variable, and c is the intercept. The modelling application of OLS linear regression allows one to predict the value of the response variable for varying inputs of the predictor variable given the slope and intercept coefficients of the line of best fit.

The line of best fit is calculated in R using the lm() function which outputs the slope and intercept coefficients. The slope and intercept can also be calculated from five summary statistics: the standard deviations of x and y, the means of x and y, and the Pearson correlation coefficient between x and y variables.

4.2 What we do here?

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
daily2020 %>%
 ggplot(aes(x = cases, y = entries)) +
 geom_point(colour = "chocolate4")+
   geom_smooth(method = "lm", color="darkgoldenrod1",fill = NA)+
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6) )
## `geom_smooth()` using formula 'y ~ x'

cor(daily2020$entries,daily2020$cases)
## [1] -0.3718981

We can see there are so many outliers because there is no Covid-19 cases at the begining of the year 2020. Therefore, I want create a new data frame which is only include the data where the pandemic starting in the New York City.
By doing that I got the data frame – real_date and that’s mean we only look at the data from March 1st to the end of year.

4.3 My real_data OLS Model

I only want date from March 1st to Dec 31.

real_data<-daily2020[60:365,]

Now we start out real_data OLS Model

library(dplyr)
real_data %>%
 ggplot(aes(x = cases, y = entries)) +
 geom_point(colour = "chocolate4")+
   geom_smooth(method = "lm", color="darkgoldenrod1",formula = y ~ x,fill = NA)+
scale_y_continuous(labels = unit_format(unit = "M", scale = 1e-6) )

cor(real_data$entries,real_data$cases)
## [1] -0.2222609
ols <- lm(real_data$entries ~ real_data$cases, data =real_data)
ols
## 
## Call:
## lm(formula = real_data$entries ~ real_data$cases, data = real_data)
## 
## Coefficients:
##     (Intercept)  real_data$cases  
##       1394934.5           -133.2
summary(ols)
## 
## Call:
## lm(formula = real_data$entries ~ real_data$cases, data = real_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1040180  -558719  -123610   389786  4336877 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1394934.47   65945.55  21.153  < 2e-16 ***
## real_data$cases    -133.18      33.51  -3.975 8.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 861000 on 304 degrees of freedom
## Multiple R-squared:  0.0494, Adjusted R-squared:  0.04627 
## F-statistic:  15.8 on 1 and 304 DF,  p-value: 8.81e-05

Here we should see a few things.

  1. R-square

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent variable that’s explained by an independent variable or variables in a regression model. Whereas correlation explains the strength of the relationship between an independent and dependent variable, R-squared explains to what extent the variance of one variable explains the variance of the second variable. So, if the R2 of a model is 0.50, then approximately half of the observed variation can be explained by the model’s inputs.

From this model our R-squared is 4.9%, which shows a very low correlation.

2.Residual standard error: 861000

We have residual standard value to estimate how good is this model fit the data. when the residual standard value is close to 0, means the model fit the data the most. When the residual standard value is 0, means this model fit the data perfecly. Here the Residual standard error is 861000 is way big, means the model is not fit the data very good.

3.P-value

P-value here is very small, therefore, we fail to reject the Null Hyphothesis.


5 Conclusion

We can tell from the regression model that the ridership of the NYC metro has no related with the covid-19 cases.

As we know, New York is a highly densely populated city, and most people travel by subway. Most people take the subway to go to work, shop and travel. However, the outbreak in New York was initially the worst in the United States. People were the first to isolate and stay at home. It’s also the place to start working from home early.

6 Reference

Hamidi, S., & Hamidi, I. (2021, January 26). Subway ridership, crowding, or population density: Determinants of covid-19 infection rates in New York City. American Journal of Preventive Medicine. Retrieved November 28, 2021, from https://www.sciencedirect.com/science/article/pii/S0749379721000593.

Ahangari, S., Chavis, C., & Jeihani, M. (2020, January 1). Public transit ridership analysis during the COVID-19 pandemic. medRxiv. Retrieved November 28, 2021, from https://www.medrxiv.org/content/10.1101/2020.10.25.20219105v1.

MTA, https://new.mta.info/.

Berger, P. (2020, September 29). New York City Subway is low risk for coronavirus transmission, study says. The Wall Street Journal. Retrieved October 18, 2021, from https://www.wsj.com/articles/new-york-city-subway-is-low-risk-forcoronavirus- transmission-study-says-11601388000.

Impact of the COVID-19 pandemic on subway ridership in New York City. Office of the New York State Comptroller. (n.d.). Retrieved October 18, 2021, from https://www.osc.state.ny.us/reports/osdc/impact-covid-19-pandemic-subwayridership- new-york-city.

Whong, C. (2020, June 19). Taming the MTA’s unruly turnstile data. Medium. Retrieved October 18, 2021, from https://medium.com/qri-io/taming-the-mtas-unruly-turnstile-data-c945f5f96ba0.